{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get Started\n",
    "\n",
    "*by [Longqi@Cornell](http://www.cs.cornell.edu/~ylongqi/) licensed under [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)*\n",
    "\n",
    "This tutorial demonstrates the process of training and evaluating recommendation algorithms using OpenRec (tutorial on implementing new recommendation algorithm: [tutorial]()):\n",
    " * Prepare training and evaluation datasets.\n",
    " * Instantiate a recommender.\n",
    " * Instantiate a sampler.\n",
    " * Instantiate evaluators.\n",
    " * Instantiate a model trainer.\n",
    " * TRAIN AND EVALUATE!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare training and evaluation datasets\n",
    "\n",
    "* Download your favorite dataset from the web. In this tutorial, we use [a relatively small citeulike dataset](http://www.wanghao.in/CDL.htm) for demonstration purpose (It requires `unrar` package to unpack the data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "try:\n",
    "    from urllib.request import urlretrieve\n",
    "except ImportError:\n",
    "    from urllib import urlretrieve\n",
    "\n",
    "urlretrieve('http://www.wanghao.in/data/ctrsr_datasets.rar', 'ctrsr_datasets.rar')\n",
    "os.system('unrar x ctrsr_datasets.rar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Convert raw data into [numpy structured array](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html). As required by the **ImplicitDataset** class, two keys `user_id` and `item_id` are required. Each row in the converted numpy array represents an interaction. The array might contain additional keys based on the use cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random\n",
    "\n",
    "users_count = 0\n",
    "interactions_count = 0\n",
    "with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:\n",
    "    for line in fin:\n",
    "        interactions_count += int(line.split()[0])\n",
    "        users_count += 1\n",
    "\n",
    "# radomly hold out an item per user for validation and testing respectively.\n",
    "val_structured_arr = np.zeros(users_count, dtype=[('user_id', np.int32), ('item_id', np.int32)]) \n",
    "test_structured_arr = np.zeros(users_count, dtype=[('user_id', np.int32), ('item_id', np.int32)])\n",
    "train_structured_arr = np.zeros(interactions_count-11102, dtype=[('user_id', np.int32), ('item_id', np.int32)])\n",
    "\n",
    "interaction_ind = 0\n",
    "next_user_id = 0\n",
    "next_item_id = 0\n",
    "map_to_item_id = dict()  # Map item id from 0 to len(items)-1\n",
    "\n",
    "with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:\n",
    "    for line in fin:\n",
    "        item_list = line.split()[1:]\n",
    "        random.shuffle(item_list)\n",
    "        for ind, item in enumerate(item_list):\n",
    "            if item not in map_to_item_id:\n",
    "                map_to_item_id[item] = next_item_id\n",
    "                next_item_id += 1\n",
    "            if ind == 0:\n",
    "                val_structured_arr[next_user_id] = (next_user_id, map_to_item_id[item])\n",
    "            elif ind == 1:\n",
    "                test_structured_arr[next_user_id] = (next_user_id, map_to_item_id[item])\n",
    "            else:\n",
    "                train_structured_arr[interaction_ind] = (next_user_id, map_to_item_id[item])\n",
    "                interaction_ind += 1\n",
    "        next_user_id += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Instantiate training, validation, and testing datasets. As the data is from users' implicit feedback, we choose the **ImplicitDataset** class, as opposed to the general **Dataset** class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openrec.tf1.legacy.utils import ImplicitDataset\n",
    "\n",
    "train_dataset = ImplicitDataset(raw_data=train_structured_arr, \n",
    "                        max_user=users_count, \n",
    "                        max_item=len(map_to_item_id), name='Train')\n",
    "val_dataset = ImplicitDataset(raw_data=val_structured_arr, \n",
    "                      max_user=users_count,\n",
    "                      max_item=len(map_to_item_id), name='Val')\n",
    "test_dataset = ImplicitDataset(raw_data=test_structured_arr, \n",
    "                       max_user=users_count,\n",
    "                       max_item=len(map_to_item_id), name='Test')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiate a recommender\n",
    "\n",
    "We use the [BPR recommender](http://openrec.tf1.readthedocs.io/en/latest/recommenders/openrec.tf1.recommenders.bpr.html) that implements the pure Baysian Personalized Ranking (BPR) algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openrec.tf1.legacy.recommenders import BPR\n",
    "\n",
    "bpr_model = BPR(batch_size=1000, \n",
    "                max_user=train_dataset.max_user(), \n",
    "                max_item=train_dataset.max_item(), \n",
    "                dim_embed=20, \n",
    "                opt='Adam')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiate a sampler\n",
    "\n",
    "A basic [pairwise sampler](http://openrec.tf1.readthedocs.io/en/latest/utils/openrec.tf1.utils.samplers.html) is used, i.e., each instance contains an user, an item that the user interacts, and an item that the user did NOT interact. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openrec.tf1.legacy.utils.samplers import PairwiseSampler\n",
    "\n",
    "sampler = PairwiseSampler(batch_size=1000, dataset=train_dataset, num_process=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiate evaluators\n",
    "\n",
    "Define evaluators that you plan to use. This tutorial evaluate the recommender against Area Under Curve (AUC)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openrec.tf1.legacy.utils.evaluators import AUC\n",
    "\n",
    "auc_evaluator = AUC()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Instantiate a model trainer\n",
    "\n",
    "The **implicit model trainer** drives the training and evaluation of the recommender using defined *implicit feedback datasets*, sampler, model, and evaluators."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openrec.tf1.legacy import ImplicitModelTrainer\n",
    "\n",
    "model_trainer = ImplicitModelTrainer(batch_size=1000, \n",
    "                             test_batch_size=100, \n",
    "                            train_dataset=train_dataset, \n",
    "                             model=bpr_model, \n",
    "                             sampler=sampler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TRAIN AND EVALUATE!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_trainer.train(num_itr=10000, \n",
    "                    display_itr=1000, \n",
    "                    eval_datasets=[val_dataset, test_dataset],\n",
    "                    evaluators=[auc_evaluator])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
